cm p - lg / 9 40 50 01 2 M ay 1 99 4 Similarity - Based Estimation of Word Cooccurrence Probabilities ∗
نویسندگان
چکیده
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and “eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz’s back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-
منابع مشابه
ar X iv : c m p - lg / 9 40 50 01 v 1 2 M ay 1 99 4 Similarity - Based Estimation of Word Cooccurrence Probabilities ∗
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and “eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the n...
متن کاملar X iv : a lg - g eo m / 9 50 20 26 v 2 9 M ay 1 99 5 ALGEBRAIC SURFACES AND SEIBERG - WITTEN INVARIANTS
متن کامل
ar X iv : c m p - lg / 9 60 50 14 v 1 1 2 M ay 1 99 6 Clustering Words with the MDL Principle
We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compar...
متن کاملSpectral Duality for Planar Billiards
ao -d yn /9 40 50 01 2 M ay 1 99 4 Spectral Duality for Planar Billiards J.-P. Eckmann1;2 and C.-A. Pillet1 Dépt. de Physique Théorique, Université de Genève, CH-1211 Genève 4, Switzerland Section de Mathématiques, Université de Genève, CH-1211 Genève 4, Switzerland Abstract. For a bounded open domain with connected complement in R2 and piecewise smooth boundary, we consider the Dirichlet Lapla...
متن کاملar X iv : q - a lg / 9 70 50 27 v 1 2 8 M ay 1 99 7 Jordanian U h , s gl ( 2 ) and its coloured realization
A two-parametric non-standard (Jordanian) deformation of the Lie algebra gl(2) is constructed, and then, exploited to obtain a new, triangular R-matrix solution of the coloured Yang-Baxter equation. The corresponding coloured quantum group is presented explicitly.
متن کامل